|
Activation(=transfer) functions are the non linear operations in neural networks. We are looking at frameworks for learning these transfer functions:
|
|
|
Activation(=transfer) functions are the non linear operations in neural networks. We are looking at frameworks for learning these transfer functions:
|
|
Consider two PDFs \(f\) and \(g\) and their CDFs noted \(F\) (with \(dF(x)=f(x)\ dx\)) and \(G\) (with \(dG(x)=g(x)\ dx\)) respectively, for random variables \(x\) and \(y\).
Problem: How to find a (bijective and differentiable) transformation \(\phi\) such that \(y=\phi(x)\) for two random variables \(x\sim f(x)\) and \(y \sim g(y)\)?
\[ \begin{array}{lc} &\\ \text{Solution (1D):} & \phi(x)=G^{-1}\circ F(x)\\ &\\ \end{array} \]
This solution has also been shown to minimize the transport cost (Quadratic Wasserstein) :
\[ \mathcal{W} = \mathbb{E}_{(x,y)\sim\pi} \left[ |x-y|^2 \right\rbrack \quad \text{s.t.}\quad f=\int \pi\ dy \quad \text{and}\quad g=\int \pi\ dx \]
Example Equalization: \(y=\phi(x)=Id\circ F(x)\) using \(g\equiv \mathcal{U}\) (uniform) \(G=Id=G^{-1}\) (identity function)
|
|
|
Example Bridge/synthesis: \(y=\phi(x)=G^{-1}\circ Id(x)\) synthesis samples \(y\sim g\) from samples \(x\) easily generated \(x\sim \mathcal{U}\)
The solution \(\phi(x)=G^{-1}\circ F(x)\) works for r.v. \(x,y\in \mathbb{R}\) but what about if \(\mathbf{x},\mathbf{y}\in\mathbb{R}^d\) ?
Iterative Distribution Transfer (IDT) algorithm proposed by Pitie et al uses the 1D solution iteratively by projecting multidimensional data for \(x\) and \(y\) onto 1D spaces.
The directions for projection are chosen orthogonal spanning \(\mathbb{R}^d\), potentially with redundancy. To vary these directions from one iteration to the next, a random rotation is used.
IDT has been applied to colour pixels (\(d=3\)).
|
|
|
|
| \(x\in\mathbb{R}^3\) | \(y\in\mathbb{R}^3\) | \(\phi_{IDT}(x)\in\mathbb{R}^3\) |
|
|
IDT has a similar structured to a DNN: a linear projection followed by a non linear mapping, this is repeated iteratively. |
Using Zhang et al (2021) taxonomy, :
Optimal Transport Methods │ ├───Regularization based │ └─── ... │ ├───Projection based ├───Random Projection │ └───Iterative Distribution Transfer (IDT) │ ├───Projection Pursuit
Solution \(\phi_{OT}(x)=G^{-1}\circ F(x)\) may not be capturing well the actual process (i.e. joint distribution \(\pi(x,y)\)) that generated the data \(\lbrace (x^{(i)},y^{(i)})\rbrace_{i=1,\cdots,n}\).
|
|
The solution \(\phi_{OT}\) (green curve) only sees datasets \(\lbrace x^{(i)}\rbrace\) and \(\lbrace y^{(i)}\rbrace\) (blue points) without any knowledge of correspondences \(\lbrace (x^{(i)},y^{(i)})\rbrace\) (black points). The solution \(\phi_{OT}\) is by definition bijective (hence is monotone). |
\[ \hat{y}=\phi_{NW}(x)=\mathbb{E}[y|x] =\int y \ p_{y|x}(y|x) dy \simeq \frac{\sum_{i=1}^n y^{(i)} \ k_h(x-x^{(i)})}{\sum_{i=1}^n k_h(x-x^{(i)})} \]
|
|
The solution \(\phi_{NW}\) (blue curve) takes advantage of correspondences \(\lbrace (x^{(i)},y^{(i)})\rbrace\) (black points) that may be available. \(\phi_{NW}\) is not monotone and its smoothness is controlled by the bandwidth \(h\) of the kernel. |
|
|
We have recently proposed to fit a spline function with correspondences \(\phi_{L2}^{Corr}\) and without \(\phi_{L2}\) that minimizes the robust \(L2\) distance between GMMs \(f\) and \(g\). |
In DNNs, the machine model corresponds to a function \(\mathcal{M}_{\theta}\) designed by compositions of a linear and non linear functions:
\[ \mathcal{M}_{\theta}(x)= \mathrm{W}_L \circ \cdots \underbrace{\phi_l \circ \mathrm{W}_l}_{\text{Layer } l} \circ \cdots \circ \phi_1 \circ \mathrm{W}_1 (x) \] Layer \(l\) : linear transformation followed by a point wise non linear activation: \[ \phi_l(x)=\left(\phi_{l,1}(x_1),\cdots,\phi_{l,N_l}(x_{N_l}) \right) \] where the scalar map \(\phi_{l,n}:\mathbb{R} \rightarrow \mathbb{R}\) is the activation or transfer function of the neuron indexed by \((l, n)\).
| In a conventional setup, the activation function is fixed: \[ \phi_{l,n}(x)=\phi(x-b_{l,n}) \]e.g. with \[ \begin{array}{lc} \text{softplus: }&\phi(x)=\log\left(1+\exp(x)\right)\\ \text{ReLu: }&\phi(x)=\max(0,x)=x^+\\ \end{array} \] |
|
|
|
Bohra et al (2020) proposed to DNNs with spline based transfer-activation functions. The DNN is optimized end-to-end. |
Is learning direction of projections needed? (cf. random projections in IDT, and wavelets chosen in scattering networks)
Is end to end optimization needed for learning activation functions?
Can (cost effective and accurate) DNNs be designed using pre-set projections + learnt activation functions?
…